{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "python3",
   "display_name": "DataAnalysis"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "import nltk\n",
    "from tools import show_subtitle\n",
    "%matplotlib inline"
   ]
  },
  {
   "source": [
    "# Chap 3 处理原始文本\n",
    "1.  如何访问文件内的文本？\n",
    "2.  如何将文档分割成单独的单词和标点符号，从而进行文本语料上的分析？\n",
    "3.  如何产生格式化的输出，并把结果保存在文件中？"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## 3.7. 用正则表达式为文本分词(P118)\n",
    "分词（Tokenization）：是将字符串切割成可以识别的构成语言数据的语言单元。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "### 3.7.1 分词的简单方法\n",
    "\n",
    "P109 表3-3 正则表达式基本元字符，P120 表3-4 正则表达式符号"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw = \"\"\"'When I'M a Duchess,' she said to herself, (not in a very hopeful \n",
    "tone though), 'I won't have any pepper in my kitchen AT ALL. Soup does very \n",
    "well without--Maybe it's always pepper that makes people hot-tempered,'...\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "[\"'When\", \"I'M\", 'a', \"Duchess,'\", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', '\\ntone', 'though),', \"'I\", \"won't\", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', '\\nwell', 'without--Maybe', \"it's\", 'always', 'pepper', 'that', 'makes', 'people', \"hot-tempered,'...\"]\n"
     ]
    }
   ],
   "source": [
    "print(re.split(r' ', raw))  # 利用空格分词，没有去除'\\t'和'\\n'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "[\"'When\", \"I'M\", 'a', \"Duchess,'\", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though),', \"'I\", \"won't\", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe', \"it's\", 'always', 'pepper', 'that', 'makes', 'people', \"hot-tempered,'...\"]\n"
     ]
    }
   ],
   "source": [
    "print(re.split(r'[ \\t\\n]+', raw))  # 利用空格、'\\t'和'\\n'分词，但是不能去除标点符号"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "[\"'When\", \"I'M\", 'a', \"Duchess,'\", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', '', 'tone', 'though),', \"'I\", \"won't\", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', '', 'well', 'without--Maybe', \"it's\", 'always', 'pepper', 'that', 'makes', 'people', \"hot-tempered,'...\"]\n"
     ]
    }
   ],
   "source": [
    "print(re.split(r'\\s', raw))  # 使用re库内置的'\\s'（匹配所有空白字符）分词，但是不能去除标点符号"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "['', 'When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', 'I', 'won', 't', 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without', 'Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered', '']\n"
     ]
    }
   ],
   "source": [
    "print(re.split(r'\\W+', raw))  # 利用所有字母、数字和下划线以外的字符来分词，但是将“I'm”、“won't”这样的单词拆分了"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "[\"'When\", 'I', \"'M\", 'a', 'Duchess', ',', \"'\", 'she', 'said', 'to', 'herself', ',', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', \"'I\", 'won', \"'t\", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '-', '-Maybe', 'it', \"'s\", 'always', 'pepper', 'that', 'makes', 'people', 'hot', '-tempered', ',', \"'\", '.', '.', '.']\n"
     ]
    }
   ],
   "source": [
    "print(re.findall(r'\\w+|\\S\\w*', raw))  # 使用findall()分词，可以将标点保留，不会出现空字符串"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "[\"'\", 'When', \"I'M\", 'a', 'Duchess', ',', \"'\", 'she', 'said', 'to', 'herself', ',', '(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', \"'\", 'I', \"won't\", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '--', 'Maybe', \"it's\", 'always', 'pepper', 'that', 'makes', 'people', 'hot-tempered', ',', \"'\", '...']\n"
     ]
    }
   ],
   "source": [
    "print(re.findall(r\"\\w+(?:[-']\\w+)*|'|[-.(]+|\\S\\w*\", raw))  # 利用规则使分词更加准确"
   ]
  },
  {
   "source": [
    "### 3.7.2 NLTK 的正则表达式分词器(P120)"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']"
      ]
     },
     "metadata": {},
     "execution_count": 9
    }
   ],
   "source": [
    "text = 'That U.S.A. poster-print costs $12.40...'\n",
    "pattern = r'''(?x)    # set flag to allow verbose regexps\n",
    "    (?:[A-Z]\\.)+        # abbreviations, e.g. U.S.A.\n",
    "  | \\w+(?:-\\w+)*        # words with optional internal hyphens\n",
    "  | \\$?\\d+(?:\\.\\d+)?%?  # currency and percentages, e.g. $12.40, 82%\n",
    "  | \\.\\.\\.            # ellipsis\n",
    "  | [][.,;\"'?():-_`]  # these are separate tokens; includes ], [\n",
    "'''\n",
    "nltk.regexp_tokenize(text, pattern)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "'(?x)'=  ['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']\n"
     ]
    }
   ],
   "source": [
    "print(\"'(?x)'= \",nltk.regexp_tokenize(text, '(?x)'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "'([A-Z]\\.)'=  ['U.', 'S.', 'A.']\n'([A-Z]\\.)+'=  ['A.']\n'(?:[A-Z]\\.)+'=  ['U.S.A.']\n"
     ]
    }
   ],
   "source": [
    "print(\"'([A-Z]\\.)'= \", nltk.regexp_tokenize(text, '([A-Z]\\.)'))\n",
    "print(\"'([A-Z]\\.)+'= \", nltk.regexp_tokenize(text, '([A-Z]\\.)+'))\n",
    "print(\"'(?:[A-Z]\\.)+'= \", nltk.regexp_tokenize(text, '(?:[A-Z]\\.)+'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "'\\w'=  ['T', 'h', 'a', 't', 'U', 'S', 'A', 'p', 'o', 's', 't', 'e', 'r', 'p', 'r', 'i', 'n', 't', 'c', 'o', 's', 't', 's', '1', '2', '4', '0']\n'\\w+'=  ['That', 'U', 'S', 'A', 'poster', 'print', 'costs', '12', '40']\n'\\w(\\w)'=  ['h', 't', 'o', 't', 'r', 'r', 'n', 'o', 't', '2', '0']\n'\\w+(\\w)'=  ['t', 'r', 't', 's', '2', '0']\n'\\w(-\\w)'=  ['-p']\n'\\w+(-\\w)'=  ['-p']\n'\\w(-\\w+)'=  ['-print']\n'\\w+(-\\w+)'=  ['-print']\n'\\w(-\\w+)*'=  ['', '', '', '', '', '', '', '', '', '', '', '', '-print', '', '', '', '', '', '', '', '', '']\n'\\w+(-\\w+)*'=  ['', '', '', '', '-print', '', '', '']\n"
     ]
    }
   ],
   "source": [
    "print(\"'\\w'= \",nltk.regexp_tokenize(text, '\\w'))  \n",
    "print(\"'\\w+'= \",nltk.regexp_tokenize(text, '\\w+'))  \n",
    "print(\"'\\w(\\w)'= \",nltk.regexp_tokenize(text, '\\w(\\w)'))# 每连续两个单词标准的字母，取后面那个字母\n",
    "print(\"'\\w+(\\w)'= \",nltk.regexp_tokenize(text, '\\w+(\\w)'))# 每个单词，取最后那个字母\n",
    "print(\"'\\w(-\\w)'= \",nltk.regexp_tokenize(text, '\\w(-\\w)'))\n",
    "print(\"'\\w+(-\\w)'= \",nltk.regexp_tokenize(text, '\\w+(-\\w)'))\n",
    "print(\"'\\w(-\\w+)'= \",nltk.regexp_tokenize(text, '\\w(-\\w+)'))\n",
    "print(\"'\\w+(-\\w+)'= \",nltk.regexp_tokenize(text, '\\w+(-\\w+)'))\n",
    "print(\"'\\w(-\\w+)*'= \",nltk.regexp_tokenize(text, '\\w(-\\w+)*'))\n",
    "print(\"'\\w+(-\\w+)*'= \",nltk.regexp_tokenize(text, '\\w+(-\\w+)*'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "'\\w+(?:)'))=  ['That', 'U', 'S', 'A', 'poster', 'print', 'costs', '12', '40']\n'\\w+(?:)+'))=  ['That', 'U', 'S', 'A', 'poster', 'print', 'costs', '12', '40']\n'\\w+(?:\\w)'))=  ['That', 'poster', 'print', 'costs', '12', '40']\n'\\w+(?:\\w+)'))=  ['That', 'poster', 'print', 'costs', '12', '40']\n'\\w+(?:\\w)*'))=  ['That', 'U', 'S', 'A', 'poster', 'print', 'costs', '12', '40']\n'\\w+(?:\\w+)*'))=  ['That', 'U', 'S', 'A', 'poster', 'print', 'costs', '12', '40']\n"
     ]
    }
   ],
   "source": [
    "print(\"'\\w+(?:)'))= \",nltk.regexp_tokenize(text, '\\w+(?:)'))  \n",
    "print(\"'\\w+(?:)+'))= \",nltk.regexp_tokenize(text, '\\w+(?:)+'))  \n",
    "print(\"'\\w+(?:\\w)'))= \",nltk.regexp_tokenize(text, '\\w+(?:\\w)'))  \n",
    "print(\"'\\w+(?:\\w+)'))= \",nltk.regexp_tokenize(text, '\\w+(?:\\w+)'))  \n",
    "print(\"'\\w+(?:\\w)*'))= \",nltk.regexp_tokenize(text, '\\w+(?:\\w)*'))  \n",
    "print(\"'\\w+(?:\\w+)*'))= \",nltk.regexp_tokenize(text, '\\w+(?:\\w+)*'))  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "'\\.\\.\\.'=  ['...']\n"
     ]
    }
   ],
   "source": [
    "print(\"'\\.\\.\\.'= \", nltk.regexp_tokenize(text, '\\.\\.\\.'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "'\\.\\.\\.|([A-Z]\\.)+'=  ['U.S.A.', '...']\n"
     ]
    }
   ],
   "source": [
    "print(\"'\\.\\.\\.|([A-Z]\\.)+'= \", nltk.regexp_tokenize(text, '\\.\\.\\.|(?:[A-Z]\\.)+'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >\\w+ (\\d+) \\w+ (\\d+) \\w+ (\\d+)< ---------------\n[('123', '456', '789')]\n--------------- >\\w+ (\\d+) \\w+ (?:\\d+) \\w+ (\\d+)< ---------------\n[('123', '789')]\n"
     ]
    }
   ],
   "source": [
    "# (?:) 非捕捉组用法对比\n",
    "inputStr = \"hello 123 world 456 nihao 789\"\n",
    "rePatternAllCapturingGroup = \"\\w+ (\\d+) \\w+ (\\d+) \\w+ (\\d+)\"\n",
    "rePatternWithNonCapturingGroup = \"\\w+ (\\d+) \\w+ (?:\\d+) \\w+ (\\d+)\"\n",
    "show_subtitle(rePatternAllCapturingGroup)\n",
    "print(nltk.regexp_tokenize(inputStr, rePatternAllCapturingGroup))\n",
    "show_subtitle(rePatternWithNonCapturingGroup)\n",
    "print(nltk.regexp_tokenize(inputStr, rePatternWithNonCapturingGroup))"
   ]
  },
  {
   "source": [
    "### 3.7.3 进一步讨论分词\n",
    "分词：比预期更为艰巨，没有任何单一的解决方案可以在所有领域都行之有效。\n",
    "\n",
    "在开发分词器时，访问已经手工飘游好的原始文本则理有好处，可以将分词器的输出结果与高品质(也叫「黄金标准」)的标注进行比较。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  }
 ]
}